Estimation of phylogeny using a general Markov model.

The non-homogeneous model of nucleotide substitution proposed by Barry and Hartigan (Stat Sci, 2: 191-210) is the most general model of DNA evolution assuming an independent and identical process at each site. We present a computational solution for this model, and use it to analyse two data sets, each violating one or more of the assumptions of stationarity, homogeneity, and reversibility. The log likelihood values returned by programs based on the F84 model (J Mol Evol, 29: 170-179), the general time reversible model (J Mol Evol, 20: 86-93), and Barry and Hartigan's model are compared to determine the validity of the assumptions made by the first two models. In addition, we present a method for assessing whether sequences have evolved under reversible conditions and discover that this is not so for the two data sets. Finally, we determine the most likely tree under the three models of DNA evolution and compare these with the one favoured by the tests for symmetry.


Introduction
The evolutionary relationship between a set of k homologous sequences of N nucleotides can be represented by a k-leaved bifurcating tree where each leaf node represents a known sequence and each internal node represents an ancestral sequence (which is almost always unknown). The phylogeny of the k sequences can be inferred by using maximum-likelihood methods, which rely on models of nucleotide substitution to infer the most likely tree. Popular phylogenetic methods, like those implemented in PHYLIP (Felsenstein 2004a), PAUP* (Swofford 2002), and Tree-Puzzle (Schmidt et al 2002), use models of nucleotide substitution that assume the evolutionary process is stationary, homogeneous, and reversible. Although a detailed mathematical description of stationarity, homogeneity and reversibility can be found in Ababneh et al (2006a), we will give a brief description of these terms in the context of molecular phylogenetics. Stationarity implies that the marginal probabilities of the four nucleotides remain constant over all the nodes of a given tree. Homogeneity implies that the instantaneous rate matrix (described in eg, Lanave et al 1984, Kishino andHasegawa 1989) is constant over an edge (local homogeneity) or constant over the entire tree (global homogeneity). Reversibility implies that the probability of sampling nucleotide i from the stationary distribution and going to nucleotide j is the same as the probability of sampling nucleotide j from the stationary distribution and going to nucleotide i, where i, j = {A,C,G,T} (Bryant et al 2004). Reversibility, therefore, implies that the process is stationary and permits us to ignore the direction of evolution. The assumptions of stationarity, homogeneity and reversibility are often violated by the data, resulting in an elevated probability of incorrect phylogenetic results (for examples of the complexity of the problem, see Ho and Jermiin 2004;Jermiin et al 2004).
In a landmark article, Barry and Hartigan (1987a) considered a general Markov model for unrooted trees, the assumptions being that the process relating each pair of nodes in the tree is Markovian and the sites are independent and identically distributed. Their model does not make the assumption of stationarity, homogeneity (local or global) or reversibility, so it is more general than the non-stationary but locally homogeneous models considered by Yang (1994) and Yang and Roberts (1995). Barry and Hartigan (1987a) considered a k-taxa tree with 2k-3 Q-matrices, one for each edge of the tree. Each Q-matrix represents the joint probability distribution of nucleotides at the two ends of the associated edge and is a 4 × 4 matrix. Since the sequences at internal nodes are not known, we can only observe the 4 k different combinations of nucleotides at the leaf nodes. These combinations together represent the joint probability distribution of the nucleotides at leaf nodes and can be written as a function of the Q-matrices. Thus, the likelihood of the observed sequences is a function of the set of Q-matrices and can be maximised by determining the maximumlikelihood estimates of the Q-matrices. The algorithm for obtaining the maximum-likelihood estimates was suggested by Barry and Hartigan (1987a) but has not received the attention that other aspects of their paper have, especially calculation of LogDet distance (Lockhart et al 1994;Steel 1994), probably because the large number of parameters was assumed to make the interpretation diffi cult.
We revisit Barry and Hartigan's (1987a) model, and describe it and the estimation algorithm in new notation -we also present a program written in Java TM to implement it. We examine the information that can be obtained from the estimates by applying their algorithm, henceforth referred to as the BH algorithm, to two sets of data, one comprising mitochondrial DNA from seven hominoids, where there is apparent stationarity and homogeneity, and another comprising 16S ribosomal RNA genes from fi ve bacterial genomes, where problems due to lack of stationarity and homogeneity have been noted previously by Galtier and Guoy (1995) as well as Foster (2004). Further, we compare the results obtained from our program with those obtained using simpler models, ie, the F84 model (Kishino and Hasegawa 1989), implemented in DNAML from the PHYLIP program package (Felsenstein 2004a), and the general time reversible (GTR) model (Lanave et al 1984), implemented in PAUP* (Swofford 2002). A likelihood ratio test (Huelsenbeck and Crandall 1997), based on the log likelihood values obtained using the phylogenetic programs, is used to determine whether one or more of the assumptions of stationarity, homogeneity, and reversibility are violated.
The joint probability distribution values for each edge of the tree can be used to determine (a) marginal probabilities at the nodes (internal nodes as well as leaf nodes), and (b) the joint probability distribution of a pair of leaf nodes. The assumption of stationarity can be examined by comparing the marginal probabilities at different leaf nodes (Ababneh et al 2006b). Since Barry and Hartigan's (1987a) method gives estimates of the joint distribution of the two end points of each edge, we can evaluate the hypothesis of reversibility by examining the joint distribution -it should be symmetric if the process is reversible. We do so for the two sets of data mentioned earlier, obtaining the surprising result that the stationary and homogeneous model for the hominoid data appears to be not reversible along some of the edges. Such comparisons seem to be possible for only a part of the tree for the bacterial data since this data set is not stationary.

A General Markov Model on Trees
The general Markov model for phylogenetic trees proposed by Barry and Hartigan (1987a) will be given using a notation that permits a more compact description. Consider an unrooted binary tree, T, (for defi nitions, see Chapter 1 of Semple and Steel (2003)) with l leaves, l -2 internal nodes (or vertices), and 2l -3 edges, for l Ն 0. For convenience, we include l = 1 with 0 internal nodes and 0 edges. Denote leaves by L = {-1, …, -l} and internal nodes by I = {1, …, l -2} (the notation of positives and negatives derives from the merge matrix given by the hierarchical clustering algorithm, hclust, in the S-PLUS or R packages). The set of all vertices is V = L ∪ I. Denote edges by E = {(i, j) : i, j∈V and adjacent}. By inserting a node numbered 0, called a root node, on any edge, and thus increasing the number of nodes and the number of edges by one, the unrooted tree can be converted into a rooted binary tree. If an edge (i, j) of the unrooted tree is deleted then two rooted sub-trees T (i,j) and T (j,i) are formed with roots at i and j, respectively.
The tree will be used to describe a model for evolutionary relationships at a site in the DNA, as in Barry and Hartigan (1987a), by considering the joint distribution of the four bases B = {A,C,G,T} at the leaves. First consider the joint distribution at ends of any edge. Let X i and X j be the values taken by bases at nodes i and j of the edge (i, j). Write for x, y ∈ B, as the joint probability. Note that, since consistency of marginal distributions at internal nodes is required, for i ∈ I, for any j such that (i, j) is an edge.
More generally, let X = (X L , X I ) denote the vector of random variables with X L = (X −1 , …, X −l ) and X I = (X 1 , ..., X l-2 ), and with each Xi taking values in B. Let Q T (x) = P(X = x) be the joint distribution of the bases at the nodes of T. The joint distribution of the bases at the leaves is then denote the sets of leaf nodes and internal nodes, respectively, in T (i, j) , then X (i, j) and X T(i, j) denote the vectors of the values of bases in L (i, j) and T (i, j) , respectively, and Take the model to be Markovian, so that, given (X i , X j ) = (x i , x j ), the conditional distribution of the bases on the leaves of the rooted sub-trees T (i, j) and T ( j, i) , given by deleting edge (i, j), are independent. Under this Markovian model the joint distribution Q L (x L ) can be written as a product of terms involving only Q (i, j) (x i , x j ) and Q i (x i ) for all edges (i, j) and all nodes i. At each site α = 1, …, N the value of a base at the i-th leaf, x iα is known, but at internal nodes the base can take any value in B. Let B iα = {x iα } if i ∈ L, and B if i ∈ I. Then the joint probability distribution of leaf nodes at site α can be expressed as is an indicator function that takes the value 1 if both x i and x j represent leaf nodes, and 0 otherwise. Also, This formula can be applied recursively to the joint distribution on a smaller tree, x until trees with only one edge are reached.
Notice that it is not necessary in this general case to put any restrictions on the model producing the joint distributions on each edge, other than consistency at internal nodes noted earlier.

Estimation
For an unrooted binary tree, T, based on k homologous sequences, each having N sites, Barry and Hartigan (1987a) gave a method of estimating the set of Q (i, j) (x, y) for x, y ∈ B, (i, j) ∈ E by maximizing the log likelihood of the bases at the leaves. Using (4), the log likelihood for an unrooted tree is ( )= ∑ 1 requires equating the derivatives of L + λ (Σ x, y∈B Q (i, j) (x, y) -1) with respect to Q (i, j) (x, y) and λ to zero, which leads to the updating equation , , x In order to minimize computational time, suitable initial values are chosen for all Q (i,j) (x, y). Then (6) is used repeatedly on all edges to update the left hand side using current values for the right hand side until the process converges. Call the values obtained , .

Precise Fit At Leaf Nodes
If i ∈ L, then summing in (6) over y ∈ B giveŝ where N i (x) denotes the number of sites at leaf i that have base x. Thus the maximization leads to a precise fi t at the leaves.

Internal Consistency
If i ∈ I and edges (i, j) and (

Algorithm Implementation
The BH algorithm was implemented in Java (Java TM 2 Platform Standard Edition, Version 1.4.2_03) using an object-oriented approach. The main classes in the program are NewickTreeTraversal, BranchDetails and MaximumAverageLikelihood. The class NewickTreeTraversal reads the unrooted tree in Newick format (Felsenstein 2004b) and constructs a binary tree. Each node is linked to a maximum of three nodes ie one parent node and two descendant nodes. The class BranchDetails stores the joint probability distribution values along each edge of the binary tree. The class Maximu-mAverage Likelihood makes use of the abovementioned classes to compute the log likelihood values and update joint probability distribution values (using formulae described in Section 3) for a user-specifi ed tree. It also generates an output fi le containing the fi nal joint probability distribution values along each edge and the log likelihood value for the entire tree. The joint probability distribution values can be used to compute divergence matrices -a helper program has been written in Java TM for this purpose.
We make use of recursion to compute the joint conditional probability distribution of all the leaf nodes connected to the sub-tree rooted at node i, ie P(L (i, j) |x). The method starts by calculating the joint probability of node i and its immediate descendant nodes. If node i is an internal node, x ∈ B and the joint probability distribution is the sum of joint probability values obtained for different nucleotide values at node i. If a descendant node is an internal node, we consider the sub-tree rooted at the descendant node and compute P(L (i,j) |x). This process is repeated until the leaf nodes are reached.
The initial joint probability distribution values (ie, the Q-values) along the edges of the binary tree are provided by the user. At the end of each iteration, we compare the Q-values before and after updation. If the sum of the square of differences is greater than the user-specifi ed value, the Q-values are updated and the next iteration begins. If none of the edges need to be updated, it implies that convergence has been achieved, and the program terminates.
To improve the program's performance, henceforth referred to as the BH program, for a given data set of matched nucleotide sequences, all unique patterns are identifi ed at the beginning of the program. The log likelihood value is computed only once for each unique pattern and the result is multiplied by the number of times a particular pattern occursthis is a commonly used procedure to reduce the time needed to estimate the likelihood of a tree.
The software will be available for download from http://www.usyd.edu.au/SUBIT/.

Computation of Edge Length
If the nucleotide sites are independent and identically distributed and the underlying model of evolution is stationary, homogeneous and reversible, we can compute edge lengths (Lanave et al 1984;Tavaré 1986;Rodríguez et al 1990) using the formula where δ ij denotes the distance between sequences at nodes i and j in terms of expected number of substitutions per site, t denotes time, π h denotes the h-th element of the diagonal matrix of stationary probabilities, and r hh denotes the h-th diagonal element of the rate matrix. A method for determining asynchronous distances was proposed by Barry and Hartigan (1987b). Although their method can be applied to the general model, if the marginal probabilities at the two ends of an edge are different, the distances are asymmetric (ie, for an edge (i, j), the distance from i to j and from j to i are different). In our paper, we have averaged the distances over the two possible directions of traversal for the purpose of edge length comparison with DNAML. Since the BH algorithm is based on joint probability distributions along the edges and does not require branch length optimization, the averaging of branch lengths does not affect the maximum-likelihood computation.

Variation in log likelihood values
For a given data set of homologous nucleotide sequences, the log likelihood value at convergence depends on the initial set of Q-values. This was observed in both fi ve-taxa and seven-taxa trees irrespective of the tree selected. This indicates the presence of multiple local maxima on the likelihood surface even for the most likely tree. This is an important result because former studies of the problem of multiple maxima on the log likelihood surface have assumed stationary, homogeneous, and reversible models of evolution. Chor et al (2000) showed that even for simple models of evolution, multiple maxima are possible while Rogers and Swofford (1999) used simulation to show that the best tree is unlikely to have multiple maxima.
For the two data sets analysed below, convergence to a local maximum, different from the global maximum, was observed only if the Q-values chosen were extreme; for example, a Q-matrix with all the joint probabilities being equal or a Q-matrix with diagonal elements much smaller than off-diagonal elements. From the Q-matrices that converged to the global maximum, we randomly selected one with a value of 1/8 along the main diagonal and 1/24 elsewhere for the computation of log likelihood values mentioned in section 5.

Application to two sets of homologous sequences
Under the Markovian model of DNA evolution, the process of evolution may or may not be stationary and homogeneous. We consider both cases and argue that the general model of DNA evolution proposed by Barry and Hartigan (1987a) is useful in both cases. For each data set, we (i) used three matched-pairs tests of homogeneity (Bowker 1948;Stuart 1955;Ababneh et al 2006b) to determine whether the sequences could be assumed to have evolved under stationary and homogeneous conditions (a prerequisite for using most phylogenetics methods); (ii) determined the degrees of freedom needed in order to compare phylogenetic results using likelihood-ratio tests; (iii) estimated and compared the trees; and (iv) conducted a comparison of edge lengths, divergence matrices and substitutional biases. We show that Barry and Hartigan's (1987a) method provides a useful reference point for choosing appropriate models of substitution, and the means for assessing whether the evolutionary process is reversible; such a method appears to be unavailable in the current literature.

Assessment of phylogenetic assumptions
The alignments of fi rst, second, and third codon sites were examined independently using the matched-pairs tests of symmetry (Bowker 1948), marginal symmetry (Stuart 1955), and internal symmetry (Ababneh et al 2006b). Given that each of these tests involve multiple comparisons of related sequences, it was necessary to interpret the p-values with caution. The matched-pairs tests of homogeneity produced p-values in the range of 1.000 to 0.024 for the fi rst and second codon sites (Tables 1 and 2), and in the range of 0.996 to 0.006 for the third codon sites (Table 3). For the 21 pairwise comparisons, only 1 p-value was observed to be lower than 0.05 for the fi rst and second codon sites whereas approximately one-fourth of the p-values for the third codon site were found to be lower than 0.05. These results are consistent with evolution under stationary and homogeneous conditions for first and second codon sites but not for third codon sites. Interestingly, all the low p-values observed for third codon sites involved comparisons with Orangutan, indicating real differences.
Given that the alignment of third codon sites provides some evidence against the evolutionary process being stationary and homogeneous, the following phylogenetic analyses were done using an alignment of fi rst and second codon sites only. Assuming stationarity, homogeneity, and reversibility, the GTR model, considered over the entire tree, would be appropriate for inferring the most likely tree. If we constrain the assumptions further by assuming that the six rate parameters in the GTR model can be reduced to two rate parameters (ie transitions and transversions), then the F84 (Kishino and Hasegawa 1989) model would be sufficient to predict the most likely tree. We determined the most likely tree by using DNAML from the PHYLIP program package (Felsenstein 2004a), PAUP* (Swofford 2002), and the BH program.

Calculating the degrees of freedom
In order to compare the fi t of the alignment to trees inferred using DNAML (Felsenstein 2004a), PAUP* (Swofford 2002), and the BH program, the degrees of freedom are needed for each estimate. Both DNAML and PAUP* consider a stationary, homogeneous, and reversible process, so a single rate matrix applies to the entire tree, and the degrees of freedom is the sum of the number of edges and the number of parameters in the rate matrix. The F84 model has fi ve parameters in the rate matrix and the GTR model has nine parameters in the rate matrix. However, in order to obtain the edge lengths in terms of the expected number of substitutions per site, the expected rate of substitution is set to 1 (Yang and Roberts 1995), so the number of free parameters in the F84 and GTR models is reduced by one. Accordingly, for a seven taxa tree, the degrees of freedom is 15 for results obtained using the F84 model and 19 for results obtained using the GTR model; the difference in the degrees of freedom for these two models is four. The model proposed by Barry and Hartigan (1987a), which does not assume stationarity, homogeneity or reversibility, has nine degrees of freedom along each edge and three degrees of freedom at each node. Thus, the degrees of freedom for a seven taxa tree inferred using the BH algorithm is 135, and the difference in the degrees of freedom is 116 and 120, respectively, for the GTR and F84 models (in relation to trees inferred using the BH algorithm).

Inferring and comparing the trees
The most likely tree obtained using DNAML and PAUP* is shown in Figure 1. As the difference in log likelihood (lnL) obtained using these two programs is 9.2 (−3635.445 for PAUP* and −3644.645 for DNAML), 2 × lnL = 18.4. Under the hypothesis that the GTR model can be reduced to the F84 model for these data, we might expect the difference to be distributed approximately as a chi-squared variate with four degrees of freedom. Thus, for the hominoid data set, there is evidence that the F84 model is not suffi cient to explain the evolutionary process.
To determine the most likely tree inferred by the BH program, we performed an exhaustive search of tree-space. The Newick representation of all the 945 possible unrooted binary trees was generated using the TreeGen program (Wolf et al 2000). The BH program used these trees as input to generate the log likelihood value for each tree. In all cases, the Q-matrices showed internal consistency and a precise fi t at the leaf nodes. The three most likely trees and their log likelihood values are shown in Table 4. The most likely tree inferred by using the BH algorithm is the same as those inferred by DNAML and PAUP*. Interestingly, the second most likely tree in Table 4 is the one that is commonly thought to represent the hominoid evolution (see eg Raaum et al 2005) whereas the third most likely tree in Table 4 is the one inferred by Hudelot et al (2003). We compared the trees in Table 4 using the SH-test by Shimodaira and Hasegawa (1999) and the approximately unbiased (AU) test by Shimodaira (2002), which are implemented in CONSEL (Shimodaira and Hasegawa 2001). The results in Table 5 show that the two most likely trees are statistically indistinguishable, possibly due to the sequences being too short (1206 bp) to rule out stochastic error, which can interact with systematic errors and prevent identifi cation of the correct tree.
For the most likely tree presented in Table 4, the log likelihood values returned by PAUP* and BH are −3635.445 and −3540.684, respectively, so 2 × lnL = 189.5226, which is large compared to a chi-squared distribution with 116 degrees of freedom. Since the large difference in log likelihood cannot be explained by the difference in the degrees of freedom of the two models, the likelihood ratio test suggests that one or more of the three assumptions of stationarity, homogeneity and reversibility are violated by the hominoid data set. Given the results from the matched-pairs tests of symmetry, marginal symmetry and internal symmetry, there is reason to suspect that the assumption of reversibility is violated.
Given that there may be doubt about the accuracy of the asymptotic approximation in cases like this, we verifi ed the results by using parametric bootstrapping. The parameters values for the GTR model were estimated on the most likely tree using the HyPhy program (Kosakovsky Pond et al 2005). One thousand alignments of 1206 nucleotides were generated on the parameter values and the most likely tree using the Seq-Gen program (Rambaut and Grassly 1997). For each alignment, we estimated the log likelihood of the data, given the tree and the GTR model or given the tree and the BH model. The values for 2 × lnL ranged from 68.29 to 152.60 with a mean of 109.90 and a median of 109.60. Approximately 71% of the values lay between 116 ± 15.23, where the latter value corresponds to the standard deviation of a χ 2 116 . This shows that the large difference in log likelihood values returned by PAUP* and BH program for the hominoid data set is signifi cant.

Tree-dependent comparison of edge lengths
The joint probability distribution values returned by BH were used to obtain edge lengths by averaging   over the two possible directions of traversal. If the process is stationary, homogeneous, and reversible, the values obtained over the two directions of traversal would be the same. However, the alignment of fi rst and second codon sites of the hominoid data is not consistent with evolution under stationary, homogeneous and reversible conditions, so the averaged edge lengths will only provide a rough estimate of the expected number of substitutions along each edge. Nonetheless, the edge lengths obtained using DNAML and BH are similar (Table 6).

Evaluation of Divergence Matrices and Substitution Biases
Given two neighbouring edges (i, j) and (k, j), the joint probability distribution for the pair (i, k) can be estinated as A generalisation of this formula, obtained by summing over all internal nodes in the path from node l to node m, and multiplying by N gives the estimated divergence matrix for any pair (l, m). For the hominoid data set, the divergence matrices computed using the estimated joint probability distribution values along each edge of the tree are close to the observed divergence matrices. In Table 7 we give the estimated and observed divergence matrix values for the Macaque-Bonobo pair.
The values indicate that the general model of DNA evolution approximates the actual process of evolution quite well. Table 6. Comparison of edge lengths obtained using BH and PHYLIP for the hominoid tree ((((((Ptro,Ppan), Ggor), Hsap), Ppyg), Hlar), Msyl). Refer Figure 1 for an explanation of node numbers. This is typical of the fi t of divergence matrices for all pairs of leaf nodes. To compare the differences between observed and estimated divergence matrices within a statistical context, we calculated a chi-squared test statistic using the formula

Edge
where O denotes the observed divergence matrix, and E denotes the estimated divergence matrix. This goodness-of-fi t index has values in the range 0.07 (Chimpanzee-Bonobo pair) to 8.19 (Human-Macaque pair). Since the marginal probabilities for each pair of leaf nodes are known, the degrees of freedom for the above-mentioned test statistic cannot exceed nine. We would obtain exactly nine degrees of freedom if E was known precisely apart from the marginal probabilities.
The divergence matrices computed in section 5.1.1 were found to be symmetric for all comparisons between the leaf nodes (Tables 1 and 2). However, when we looked at the estimated joint probability distributions for individual edges, we observed a distinct lack of symmetry.
We used the Bowker (1948) and Stuart (1955) test statistics using NQ (x, y), for x, y = 1, …, 4, as pseudo-observations. The resulting pseudop-values are provided in Table 8. Although these values are based on Q-matrices estimated by the BH program and therefore are not the true p-values, they are useful indices for measuring symmetry and provide a clear indication of lack of symmetry for internal node to leaf node edges.
An examination of the corresponding estimated joint probabilities shows a bias for A → G and C → T substitutions over G → A and T → C substitutions for ancestral to leaf node transitions (Table 9). Since a reversible Markov process would result in a symmetric joint distribution of the end points of an edge, there is evidence that the process in fact is not reversible. This observation is consistent with the earlier research on mammalian genes. Wise et al (1998) found a directional bias in nucleotide substitutions in the human mitochondrial genome whereas Eyre-Walker (1999), using a stationary and infi nitesites model, found that a base composition bias in mammals cannot be explained by the mutation bias hypothesis. Subsequently, assuming a stationary model, Smith and Eyre-Walker (2001) found that the synonymous codon bias in humans cannot be explained by the mutation bias hypothesis whereas Galtier et al (2001) have argued that the evolution of GC-content in mammals is explained by a biased gene conversion hypothesis.

Assessment of phylogenetic assumptions
The matched-pairs tests of symmetry (Bowker 1948), marginal symmetry (Stuart 1955) and internal symmetry (Ababneh et al 2006b) show that the sequences are highly unlikely to have evolved under stationary and homogeneous conditions (Table 10). We analyzed this data set, in a similar manner to the hominoid data set, by (i) considering the maximum-likelihood results obtained using different models of DNA evolution; (ii) comparing the trees using the goodness-of-fi t measure described in section 5.1.5; and (iii) comparing the trees based on tests of symmetry along individual edges. However, we did not perform the likelihood ratio test as the difference in log likelihood values obtained using the BH program and PAUP* is extremely large and could not have arisen by chance.

Inferring the trees
Using the BH program, log likelihood values were found to fall in the range from −4387.239 to −4289.511. The most likely tree obtained is shown in Figure 2 and henceforth referred to as tree #1. Like the other data set, the marginal probabilities at the internal nodes are consistent and the marginal probabilities at the leaf nodes fi t the observed data precisely. The most likely tree inferred using DNAML and PAUP* is shown in Figure 3 and henceforth referred to as tree #2. Even the BH program returned this particular tree as the second most likely with a log likelihood value of −4296.22. By contrast, the log likelihood value returned by PAUP* for the same tree is −4401.41722. The large difference in log likelihood values returned by BH and PAUP* is expected because the evolutionary process is highly unlikely to have been stationary and homogeneous (Table 10) and, hence, cannot be approximated by the GTR model of DNA evolution. For the bacterial data set, Foster (2004) considered rooted trees with locally homogeneous processes acting on them and concluded that two different rate matrices were suffi cient to describe the evolutionary process. However, Foster's (2004) method requires the frequency parameters to be known in advance and he chose two sets of frequency parameters (and hence two different rate matrices) based on the observation that the marginal probabilites at leaf nodes could be grouped into two different categories. For a large tree, intrepreting the closeness of frequency parameters might prove to be a diffi cult task. Another possible limitation of Foster's (2004) approach is that the change in rate matrices owing to changes in parameters other than the frequency parameters are not considered. For example, the GTR model has fi ve free parameters in addition to the frequency parameters and a   change in those five parameters would also change the rate matrix. This limitation exists even for the N1 and N2 models proposed by Yang and Roberts (1995). In contrast, the BH model uses the available data to automatically adjust the parameters from edge to edge and can be used to identify the portions of the tree where the rate matrix is homogeneous.

Comparison of edge lengths
The edge lengths calculated using BH can be used to obtain the distance between a pair of leaf nodes. For example, the distance from Thermus to Thermotoga for tree #1 can be obtained by adding the distances along the edges (Thermus, Node-3), (Node-3, Node2), (Node-2, Node-4) and (Node-4, Thermotoga). For tree #1 we found that the distances between Deino-coccus-Thermus (0.191) and Thermus-Thermotoga (0.194) are quite close (Table 11a). Since the edge lengths are only an approximation, it is possible that phylogenetic programs that assume the process to be stationary, homogeneous and reversible would infer a tree favouring Thermus closer to the Thermotoga-Aquifex pair (tree #2), an observation supported by the output from DNAML and PAUP*, and by the fact that the five sequences can be divided into two groups according to their GC content, where the group containing GC-rich sequences comprises Aquifex, Thermotoga and Thermus and the other group comprises Bacillus and Deinococcus (Figure 3). Although the evolutionary process is not stationary and homogeneous for the bacterial data set, the edge lengths obtained using BH for tree #1 Figure 3. The most likely bacterial tree inferred using GTR (and F84) models (tree #2). The GC content of the sequences is included (based on Table 14a).  and tree #2 are within the confi dence interval specifi ed by DNAML (Table 11). The closeness of the edge lengths returned by these two programs for this data set shows that averaging of edge lengths might be considered an adequate approximation.

Evaluation of Divergence Matrices and Substitution Biases
The closeness of estimated and observed divergence matrices can be measured using the goodness-of-fi t index described in section 5.1.5. For tree #1, the estimated and observed divergence matrices are quite close to one another except for the pair Bacillus-Deinococcus (Table 12).
In contrast, the divergence matrices for tree #2 has a low goodness-of-fit index value for the Bacillus-Deinococcus pair but high indices for the Bacillus-Thermotoga and the Bacillus-Aquifex pairs (Table 13). Since signifi cant differences exist between observed and estimated divergence matrices for both tree topologies, the evidence is insufficient to favour one particular tree over the other.
Since the marginal probabilities at the leaf nodes Thermus, Thermotoga and Aquifex and the internal nodes are quite close (Table 14a, Table 14b and Table 14c), and the marginal probabilities of Bacillus and Deinococcus (Table 14a) are close to one another, but different from those of the other taxa, tree #2 is the simplest model agreeing with the tests of symmetry. However, the BH program returned a slightly higher log likelihood value of −4289.511 for tree #1 compared to a value of −4296.22 for tree #2 and some authors (see eg Gupta 2000) have favoured the close relationship of Thermus and Deinococcus (as in tree #1).
We performed tests of symmetry along each edge for tree #1 and tree #2 using NQ(x,y), for x, y = 1, …, 4, as pseudo-observations (refer section 5.1.5) and the resulting pseudo-p-values are described in Table 15. As noted earlier, the evolutionary process is not stationary for the fi vetaxa tree and Table 15b suggests a stationary process for the sub-tree containing leaf nodes Thermus, Thermotoga and Aquifex, distinct from the process that gave rise to Bacillus and Deinococcus.
An examination of the estimated joint probabilities for edges in tree #1 shows that there are large biases in the patterns of substitutions along many of the edges. For example, in the case concerning the terminal edge to Deinococcus, there is a strong bias in A → C, A → G, T → C, and T → G substitutions over C → A, G → A, C → T, and G → T substitutions (Table 16).

Performance
We ran all programs on a dual processor 1.25 GHz PowerPC G4 with 512 MB of DDR SDRAM and Mac OS X version 10.3.9 as the operating system. For a five-taxa or seven-taxa tree with 1206 nucleotide sites per taxon, our program took approximately one second to compute the likelihood for a single tree. For a 10-taxa tree having 1202 sites per taxon, it took 4 seconds to compute the likelihood for a single tree. To compare the performance of BH program with PAUP* (Swofford 2002), we considered seven hominoid species (refer Section 5.1) and computed the likelihood of each of the 945 unrooted trees using the two programs. In PAUP* (Swofford 2002), we selected the GTR model (Lanave et al 1984) and determined the maximum-likelihood estimates of all the eight free parameters. We found that the BH program took 33% longer than PAUP* (Swofford 2002) to compute the likelihood of the 945 trees.

Conclusion
By modifying the GTR model such that it still has the constraints of stationarity and homogeneity but not reversibility, we can obtain the general 12parameter model, where the 12-paramaters correspond to elements in the rate matrix (excluding the stationary probabilities). An even more general model can be obtained by considering the 12-parameter model over each edge of the tree. The only assumption made by such a model is that each edge (i, j) has a Markovian process defi ned over it; such a general non-homogeneous model was proposed by Barry and Hartigan (1987a).
We have implemented the algorithm by Barry and Hartigan (1987a) and used it to analyse two data sets − a hominoid data set with apparent stationarity and homogeneity, and a bacterial data set with apparent violation of these assumptions. We have also compared the results obtained using two different approaches and found that if the tests of symmetry indicate the evolutionary process to be stationary and homogeneous, then the most likely trees inferred using the F84 model, the GTR model and the general non-homogeneous model are the same. However, the log likelihood values obtained under the GTR model differ signifi cantly from the general non-homogeneous model, providing evidence that the evolutionary process violates some of the assumptions made by the GTR model. Although the assumptions of stationarity and homogeneity can be assessed using tests of symmetry, there is no test available for checking the reversibility condition. However, values of the joint probability distribution returned by the BH program can be examined for reversibility; a symmetric Q-matrix for edge (i, j) corresponds to a reversible process along that edge.
For a stationary and homogeneous process, the edge lengths obtained from the general nonhomogeneous model are within the confi dence interval specifi ed by the F84 model and, in this respect, there is no obvious gain in using the general nonhomogeneous model for the hominoid data set.
If the process is not stationary and homogeneous, as in the case of bacterial data set, then the preferred tree obtained using the F84 and GTR models may differ from that obtained using the general nonhomogeneous model. The tests of symmetry using divergence matrices of leaf nodes favour the tree obtained using the F84/GTR model. However, these tests ignore the possibility of varying evolutionary rates along different edges of the tree. Similarly, the F84 and GTR models of nucleotide substitution assume a constant rate of substitution for the entire tree and are less likely to fi nd instances of convergent evolution. In contrast, the general nonhomogeneous model incorporates rate heterogeneity along each edge and, therefore, is more likely to fi nd instances of convergent evolution.
Finally, we have shown that the trees obtained using the general non-homogeneous model can be compared using a goodness-of-fi t index that measures the closeness of expected and observed divergence matrices. For both the hominoid and bacterial data sets, the estimated joint probability distribution matrices were found to be asymmetric for some of the edges connecting internal nodes to lead nodes. This bias in substitution implies a lack of reversibility, so it may be inappropriate to analyse the sequences using phylogenetic models that assume stationary, homogeneous and reversible conditions.

Future work
Our implementation of Barry and Hartigan's (1987a) algorithm assumes that the nucleotide sites are independent and identically distributed. However, to make their algorithm more general, we need to incorporate rate heterogeneity among sites. Felsenstein and Churchill (1996) proposed the inclusion of a hidden Markov model to allow for rate variations among sites; perhaps a similar model could be used in the context of the BH algorithm.
Secondly, our implementation of Barry and Hartigan's (1987a) algorithm calculates the maximum-likelihood for a user-specifi ed tree. It does not search the treespace for the most likely tree and therefore is limited to analysis of a small number of taxa (k Յ 7). Although searching through the entire tree space is an NP-hard problem, the computation time can be reduced by using a search strategy such as branch and bound (Hendy and Penny 1982) or other heuristic methods. We have modifi ed Lewis' genetic algorithm (1998) to search through the tree space but the processing time is far greater than that required by PHYLIP (Felsenstein 2004a) or PAUP* (Swofford 2002). One possible way of reducing the processing time would be to implement a parallel version of the BH algorithm. Some of the other heuristic methods that might be useful are tree-fusing (Goloboff 1999) and simulated annealing (Metropolis et al 1953;Salter and Pearl 2001). We also need to search the likelihood surface more exhaustively and, if possible, identify the Q-values that converge to a global maximum. Finally, we need to understand better the statistical properties associated with assessment of symmetry of joint probability distribution matrices.

Acknowledgment
This research was partly funded by a Discovery Grant (DP0453173) from the Australian Research Council. The hospitality of Patrick Forterre and the Pasteur Institute is gratefully acknowledged by LSJ. This is research paper #015 from SUBIT.